Variant calling and genotyping accuracy of ddRAD-seq: Comparison with 20X WGS in layers

Whole Genome Sequencing (WGS) remains a costly or unsuitable method for routine genotyping of laying hens. Until now, breeding companies have been using or developing SNP chips. Nevertheless, alternatives methods based on sequencing have been developed. Among these, reduced representation sequencing approaches can offer sequencing quality and cost-effectiveness by reducing the genomic regions covered by sequencing. The aim of this study was to evaluate the ability of double digested Restriction site Associated DNA sequencing (ddRAD-seq) to identify and genotype SNPs in laying hens, by comparison with a presumed reliable WGS approach. Firstly, the sensitivity and precision of variant calling and the genotyping reliability of ddRADseq were determined. Next, the SNP Call Rate (CRSNP) and mean depth of sequencing per SNP (DPSNP) were compared between both methods. Finally, the effect of multiple combinations of thresholds for these parameters on genotyping reliability and amount of remaining SNPs in ddRAD-seq was studied. In raw form, the ddRAD-seq identified 349,497 SNPs evenly distributed on the genome with a CRSNP of 0.55, a DPSNP of 11X and a mean genotyping reliability rate per SNP of 80%. Considering genomic regions covered by expected enzymatic fragments (EFs), the sensitivity of the ddRAD-seq was estimated at 32.4% and its precision at 96.4%. The low CRSNP and DPSNP values were explained by the detection of SNPs outside the EFs theoretically generated by the ddRAD-seq protocol. Indeed, SNPs outside the EFs had significantly lower CRSNP (0.25) and DPSNP (1X) values than SNPs within the EFs (0.7 and 17X, resp.). The study demonstrated the relationship between CRSNP, DPSNP, genotyping reliability and the number of SNPs retained, to provide a decision-support tool for defining filtration thresholds. Severe quality control over ddRAD-seq data allowed to retain a minimum of 40% of the SNPs with a CcR of 98%. Then, ddRAD-seq was defined as a suitable method for variant calling and genotyping in layers.


Introduction
The development of Next-Generation Sequencing (NGS) approaches has revolutionized genetic marker discovery and genotyping.Depending on the chosen approach, the balance between the marker density, genotype accuracy, the degree of multiplexing of individuals and the experimental costs may vary.Among these approaches, whole genome sequencing (WGS) allows the simultaneous detection and genotyping of the majority of individual polymorphisms at a chosen sequencing depth [1][2][3].However, it is still challenging mostly because of the sequencing costs per sample.Then, cheaper alternative methods such as low depth WGS have been studied [4,5].Compared to deeper WGS, low depth WGS offers a large panel of single nucleotide polymorphisms (SNPs) per sample while increasing inter-individual variability and lowering genotyping accuracy [2,6].Moreover, sequencing the whole genome of every individual in a population is often unnecessary, as many biological questions requirering genomic markers (population genetics, genomic selection, genetic diversity studies. ..) can be answered using only a subset of genomic regions [6].Alternative approaches as SNP chips can be used to genotype only a subset of SNP [7].But, compared to WGS, the selection of a subset of SNPs distributed equidistantly on the genome with maximal MAF values leads to ascertainment bias, causing issues in the interpretation of genetic diversity in the population [8][9][10].
Restriction site associated DNA sequencing (RAD-seq) is a great alternative to WGS.RADseq are de novo approaches targeting a subset of the genome, thus reducing its complexity and providing a reliable set of markers.Practically, these sequencing methods begin by an enzymatic digestion followed by a filtration step based on the size of the enzymatic fragments (EFs).Then, remaining EFs are amplified by PCR to create a library.The fragments of this library are then sequenced from each end.Depending on sequencing capabilities and fragment size, the central part may not be sequenced (The size filtration step helps to limit this gap).The diversity of restriction enzymes (REs) available and ways to combine them make RAD-seq methods versatile assay tools [6].With a good reference genome, the reads can be mapped, which improves the proportion of markers shared between individuals [6,11,12].
The low proportion of shared markers between individuals is a common drawback of RAD-seq approaches.During DNA digestion, REs recognize a specific motif called a restriction site (RS) to cut the DNA.With ddRAD-seq, two different REs will recognize two different restriction sites.When a restriction site is methylated, an RE sensitive to methylation will not be able to access it, and the fragment will not be created.When a mutation occurs in a restriction site, this may be modified and the fragment will not be created either.A mutation can also be responsible for the creation of a restriction site and therefore the creation of a new fragment.These phenomena usually occur on only one of the 2 chromosomes.So, when the enzymatic fragment is sequenced, only one chromosome will be sequenced.The result will be incomplete genotyping, leading to an interpretation of a homozygous genotype.
Thus, mutations in a RS tends to misestimate genetic diversity within the population [13].PCR bias are also a common bias of RAD-seq approaches leading to a miss estimation of genetic diversity [14].Numerous studies have demonstrated methods to mitigate variant calling and genotyping errors from library preparation to bioinformatics processing of sequencing data [15].
To ensure the lowest possible error rate, all studies systematically apply quality control to variant calling and genotyping data [16].The most common quality control filters are the call rate SNP (CR SNP ) [17,18] and average sequencing depth per SNP (DP SNP ) [13,19,20].These filters increase the chances for a SNP to be detected and that allelic frequencies will be well represented in the population.They also ensure to limit the impact of mutations in the RS and methylations on the final data set [13,21].Other filters, such as the probability that the allele frequency distribution respects the Hardy-Weinberg equilibrium (HWE) or the minor allele frequency (MAF) [11] are commonly used to identify and remove SNPs with a high genotyping error rate.The thresholds chosen for each of these filters are rarely justified in the bibliography [22] and depend greatly on the application of the study [6].Some quality control filter thresholds are almost standardized, while others vary from study to study.However, in the case of sequencing methods based on genome reduction, losing genetic information, or wrongly assuming that all SNPs are correct after quality control, can have significant consequences for the conclusions of a study, depending on the application [15,[22][23][24][25][26].But, because of the infinite number of events that cause genotyping errors, it is impossible to hope to get rid of them entirely [27].If not, the studies recommend at least to quantify them in order to give a confidence interval to their results [26][27][28].
Variant calling and genotyping errors can be quantified through the introduction of replicate sample, by comparing the results of the two sequencings with each other.Variants that are not common to both replicates will be considered erroneous [12,26,28].However, this method cannot identify genotyping errors due to mutations in RS [29] which is a major biases for RAD-seq approaches.Then, the best way to quantify the variant calling and genotyping quality of a sequencing method is to compare it to another more reliable reference method [27,29].The availability of sequencing data reliable enough to be considered as a reference representing the "truth", on the same genomic regions as the RAD-seq method, offers the possibility of calculating the sensitivity and accuracy of the RAD-seq method.Sensitivity corresponds to the ability of the RAD-seq method to detect all the SNPs detected by the reference method, in the genomic regions it covers.Precision, on the other hand, reflects the rate of loci wrongly considered as SNPs by RAD-seq.Sensitivity and accuracy are two indicators of the amount of variant calling and genotyping errors that are rarely found in the literature, and even less so for RAD-seq approaches [3,20] although they have been reported in other studies [30,31].The reliability of sensitivity and precision measurements depends on the quality of the sequencing data used as a reference.In literature, genotypes obtained by a RAD-seq approaches have already been compared to considered "more reliable approaches" such as Sanger sequencing [32], SNP chips [19], or even WGS [29] but never for layers.
Among RAD-seq approaches, double digest Restriction site Associated DNA sequencing (ddRAD-seq) is a method which, thanks to the use of two different RE to digest DNA and a size filtration step for EFs, reduces the inter-individual variability of EFs generated and SNPs detected compared with other RAD-seq methods.It drastically reduces the rate of variant calling and genotyping errors compared with other RAD-seq approaches.The use of two REs also facilitates adapter design and reduces sequencing costs per individual and per base, thus offering the best multiplexing capability compared with other RAD-seq approaches [14,24].Moreover, ddRAD-seq is a sequencing method that can be customized (RE choice, filtering method) to suit the needs of each study [6,24,33].This makes ddRAD-seq a reliable method for variant discovery and genotyping of plants [33][34][35][36][37][38] and animals [39][40][41][42][43]. But, as with other RAD-seq methods, there is no consensus in the literature on the quality control and its filters, according to application, species or protocol features.Furthermore, the thresholds chosen for quality control filters are often not justified or based on other studies.Their real impact on the quality of genotyping data is rarely studied.
But, despite its bias, ddRAD-seq represents a major opportunity for the poultry industry as a small number of markers is sufficient to perform for various applications such as genomewide association studies [44], linkage disequilibrium calculation [45], CNV detection [46], genomic selection [47,48] or new variant discovery.ddRAD-seq is cheaper than deep WGS or HD chips and most of the time more accurate than low depth WGS in animal species.Compared with LD chips, RAD-seq sequencing enables the integration of the whole genome including micro-chromosomes and new markers unavailable on the HD chip.These microchromosomes are not well represented on commercial chips.The markers present on the LD chips come from the Affimetrix HD commercial chip [45], which was designed at the time of the galGal4 version of the reference genome (Nov, 2011).At that time, many chromosomes were not fully sequenced.As a result, several micro-chromosomes do not show any SNPs on the HD chip.
Then, the aim of this study is to assess ddRAD-seq quality of sequencing in terms of variant calling and genotyping, for bi-allelic markers, according to a set of population scale filtering options by comparison with 20X WGS.The 20X WGS was taken as a reference, as an effective coverage of 15X is considered as sufficient to achieve high-quality genotyping for WGS [2], The three parts of this work are (i) to describe and compare the SNP calling and genotyping data between WGS and ddRAD-seq, more precisely, (ii) to estimate sensitivity and precision of ddRADseq SNP calling and finally, (iii) to study the genotype concordance of the common SNP between deep WGS and ddRAD-seq.

Genome scale parameters
With the 20X WGS, 9,219,123 bi-allelic SNPs were detected on chromosomes 1 to 39 and Z and 51,050 on contigs.With ddRAD-seq, 349,497 bi-allelic SNPs were detected on chromosomes 1 to 39 and Z and 712 on contigs.There were 327,364 SNP common to both methods.
In both cases, at least half of the SNPs (60.9% in 20X WGS and 50.8% in ddRAD-seq) were located on the macro-chromosomes (1-5).Both methods obtained similar results regarding the percentage of SNPs detected on the intermediate chromosomes (6-10) with 15.4% and 16.3% for the 20X and ddRAD-seq respectively.On the contrary, 29.7% of the SNPs identified in ddRAD-seq were located on the micro chromosomes (11-39) against only 19.5% of those found in 20X WGS.Finally, 4.2% of the SNPs from 20X WGS and 3.2% of the SNPs from ddRAD-seq were found on the Z sexual chromosome (Table 1, Fig 1).
As shown in the Fig 1, the number of SNPs were similarly distributed on chromosomes in 20X WGS or in ddRAD-seq.The number of SNP identified on each chromosome was significantly correlated with the length of each chromosome in 20X WGS (ρ = 0.99) and in ddRADseq (ρ = 0.98).The mean distance between two adjacent SNPs called was 3,221 bp and 130 bp for ddRAD-seq and 20X WGS respectively.

Sensitivity and precision
In silico estimation of the genomic regions that should be covered by enzymatic fragments (EFs) theoretically generated by the ddRAD-seq protocol was performed as described in the Materials and Methods.So, 860,138 Pst1 restriction sites (RS) and 447,997 Taq1 RS were identified on the reference genome GRCg7b.Moreover, 33,600 Pst1 and 63,564 Taq1 RS were created by mutations in our population when considering every individual.A total of 285,004 EFs between 200 and 500 base pair (bp) should have been theoretically generated and pair-end sequenced on an average of 150 bp according to in silico prediction.In 20X WGS, considered here as the reference sequencing approach, 638,135 of the 9,219,123 SNPs were identified inside the expected EFs.Therefore, we expected a sensitivity of 6.9% for ddRAD-seq at a genome-wide scale and a precision surrounding 100%.But, the real sensitivity of ddRAD-seq at a genomic scale was 3.6% and its precision was 93.7%.It represents a loss of sensitivity of 57.9% compared to what was expected.
Then, inside the expected EFs, 214,495 SNPs were identified by ddRAD-seq and 206,872 SNPs were commonly identified by ddRAD-seq and 20X WGS.Using 20X WGS as the reference approach, the effective sensitivity of ddRAD-seq inside the expected EFs was 32.4% and its precision was 96.2%.It represents an even greater loss of sensitivity (67,6%) compared to what was expected then at a genomic scale.This suggested that, in ddRAD-seq, some SNPs were detected outside the expected EFs.Considering the total number of 349,497 SNPs detected in ddRAD-seq, only 61.5% of them were found inside the expected EFs.

Locations of SNPs outside the EFs
The location of SNPs outside of the expected EFs (i.e., 200 to 500 bp framed by the two enzymatic restriction sites) was investigated as described in Materials and methods.Out of the 134,552 SNPs identified outside the EFs, 18,816 SNPs were found in the 10 bases on each end of these EFs.A total of 47,407 SNPs were located in regions covered by EFs less than 200 bp long but generated by the combination of Taq1 and Pst1.Also, 12 342 SNPs were located in genomic regions covered by EFs cut by Taq1 on both ends and 59 125 by Pst1 on both ends.Additionally, 119,116 SNPs were located in regions corresponding to EFs resulting from the failure of both sizing and selection on RE steps in the ddRAD-seq protocol.Genomic regions concerned by each scenario can overlap with one another.All the effectives of SNPs that could be identified by multiple scenarios overlapping was described in the Fig 2 .Finally, considering each scenario and their overlaps, 4,265 SNPs called out of the EFs were not explained by any of these hypotheses.
For ddRAD-seq, the ratio between SNPs inside and outside the theoretical EFs was not different between the chromosomes except for the sexual chromosome Z, where there was as many SNPs inside and SNPs outside the theoretical EFs (Fig 3).There was no difference of location in specific chromosomic regions between the two sets of SNPs (S1 Fig).

Distribution of CR SNP , DP SNP and MAF values
The CR SNP , on average, was significantly higher in 20X WGS (0.99) than in ddRAD-seq (0.The mean MAF was similar between 20X WGS (0.18) and ddRAD-seq (0.

19). The distribution of MAF values was not different between ddRAD-seq and 20X WGS (S2 Fig).
Individuals mean depth of sequencing (DP ind ) were 16X for 20X WGS and 11X for ddRAD-seq.Individual call rate (CR ind ) were respectively 99.4% for 20X WGS and 55.4% for ddRAD-seq.

ddRAD-seq genotyping reliability
The concordance of ddRAD-seq genotypes with 20X WGS was assessed by comparing, when it was possible, ddRAD-seq genotypes to 20X WGS ones.Genotypes were said comparable https://doi.org/10.1371/journal.pone.0298565.g005Finally, ddRAD-seq SNPs were filtered according to multiple combinations of DP SNP and CR SNP threshold.Afterwards, the mean CcR for remaining SNPs and the percentage of ddRAD-seq SNPs kept after applying these filters were calculated (Fig 6C).This multi-criteria filtering approach (DP SNP , CR SNP ) makes it possible to assess the number of SNPs retained depending on the objectives of genotyping reliability.For example, by filtering ddRAD-seq SNPs according to a threshold of 5X for the DP SNP and a CR SNP of 0.8 in ddRAD-seq, approximately 40-50% of ddRAD-seq SNPs will be retained, and 95.1% of these retained SNP will have a concordant genotype with 20X WGS.With the application of QC filters on ddRAD-seq data (CR SNP and DP SNP ) with the highest thresholds possible, poor-quality SNPs can be eliminated by retaining a minimum of 27% genotyped SNPs for all individuals with 98% of reliable genotypes.

Table 2. Effectives of comparable genotypes (GTs) according to the location of the associated SNPs, inside or outside the expected enzymatic fragments (EFs) theoretically generated by the ddRAD-seq protocol.
The effectives of ddRAD-seq GTs matching (concordant) or not matching (discordant) with 20X WGS genotypes, when compared one by one for an individual, for a SNP.

Inside the expected EFs
Outside

Discussion
The aim of the study was to assess the variant calling and genotyping quality of ddRAD-seq in laying hens, as an alternative to low-density sequencing methods (LD chip or low-depth WGS).Due to the diverse nature of ddRAD-seq protocols (enzyme pairs, size filtration method, and bioinformatics processing), estimating the quality of our variant calling and genotyping data by comparing them to the literature was challenging.Therefore, the results from ddRAD-seq were compared to results from 20X WGS obtained for the same individuals.Prior research has demonstrated that comparing a test sequencing method to a reference method allows for estimating the quality of variant calling and genotyping [29].Additionally, for WGS, an effective coverage of 15X is considered as sufficient to achieve high-quality genotyping [2].With an observed average DP SNP of 16X for the 20X WGS, it was deemed a reliable basis for comparison with ddRAD-seq and representative of reality.
Initially, the study revealed that ddRAD-seq exhibited high precision (93.7%).Literature reports precision rates for RAD-seq studies ranging around 90 to 94% [29].The SNPs were well-distributed across all chromosomes, including micro-chromosomes.This represented an advantage for ddRAD-seq compared to commercial chips that do not provide information on all micro-chromosomes.Also, the MAF distribution showed that unlike the SNP chips, ddRAD-seq isn't affected by an ascertainment bias.The genome-scale sensitivity of ddRADseq at 3.6% was comparable to similar methods found in the literature (0.5 to 5.5% depending on enzyme pairs) [20].
However, at the scale of the theoretically expected enzymatic EFs by the protocol in our population, a significantly lower sensitivity of ddRAD-seq than expected was observed (32.4%).Half of the SNPs detected by ddRAD-seq were located outside the expected EFs, indicating that the sequenced EFs did not correspond to reality.This discrepancy is due to in silico estimations based on the known reference genome in the literature [6,50].To approach the actual digestion results and leveraging data from the 20X WGS, restriction sites created and destroyed by mutations were integrated from the outset.These mutations, documented as significant sources of variability in generated EFs among individuals, were described in the literature.It is also acknowledged that the quality of the reference genome significantly impacts in silico simulation of EFs [37].To mitigate this known bias, the latest version of the reference genome was used [51].Despite these precautions, a substantial difference between the expected REFs, based on protocol descriptions, and reality was observed.Some SNPs were found in genomic regions covered by EFs smaller than 200 bp, despite the protocol's filtration step.Literature describes that for ddRAD-seq, depending on the enzyme pair and size filtration method chosen, a significant difference in size distribution between the expected and sequenced EFs.This holds particularly true when using a 4-base cutter and a 6-base cutter as the enzyme pair, as in our case [50].Therefore, it is not surprising to find some SNPs in genomic regions covered by EFs smaller than 200 bp with ddRAD-seq.
We also noted a small proportion of SNPs (14.0%) located outside the anticipated EFs, within a 10-base proximity of the expected EFs.Drawing from existing literature, we hypothesized that these might have originated from degraded DNA fragments.Specifically, ddRADseq is highly sensitive to DNA quality among RAD-seq methods [12], known to significantly influence enzymatic digestion efficiency and susceptibility to UV light exposure [15].Some EFs may have been cleaved by a restriction site on one end and, despite the use of sticky end sites during adaptor ligation, improperly bound to these adaptors on the other end.Usually, after adaptor linking, EFs proceed through the rest of the protocol for amplification and sequencing.Based on previous observations in the literature, it's plausible that despite the protocol's trimming step, certain EFs smaller than twice the size of a read (~300 bp) might have been sequenced in pair-end, potentially causing partial adaptor contamination [50].
Among the remaining SNPs that didn't align with expected EFs or the previously described scenarios, some were genotyped by ddRAD-seq in genomic regions corresponding to EFs generated by the same restriction enzyme on both ends.This suggests that some EFs weren't appropriately filtered to have two distinct adaptors at each end.According to the protocol, for an EF to be sequenced, it must be produced by the combination of Taq1 and Pst1.The majority of SNPs outside the EFs were situated in regions that corresponded to EFs cut by the same restriction site at both ends and were less than 200 bp long.We thus inferred that, within our ddRAD-seq protocol, the adaptor filtering step might not have been completely efficient.Several studies have also observed this phenomenon [52,53].These studies describe the presence of EFs generated exclusively by the first introduced RE in the protocol but not by the second one.In our case, we observed SNPs located in both Taq1-Taq1 and Pst1-Pst1 EFs with more Pst1-Pst1.Regarding the frequency of Taq1 and Pst1 RS on the reference genome, we hypothesized that the difference between the number of SNPs located in Taq1-Taq1 EFs or Pst1-Pst1 EFs was more due to the larger number of Pst1 RS.It is possible to assess the efficacy of the adaptor filtration step for EFs using methods like qPCR, yet very few studies do so [37,53], and they didn't report the results.Similar to most studies employing ddRAD-seq, no quality control for this step was performed in our study.Therefore, it's conceivable that the retention of a portion of EFs generated by a single restriction enzyme despite the filtration step is a generally acknowledged characteristic of ddRAD-seq.
Nevertheless, the random nature of previously cited bias leading events, among the pool of samples, should lead to average CR SNP and DP SNP values lower than those of the SNPs genotyped in the regions covered by the expected EFs.Globally, this would result in a loss of CR SNP and DP SNP at a genome-wide scale [15].
SNPs outside the expected EFs exhibited lower CR SNP and DP SNP compared to SNPs within the EFs, impacting the overall average values of CR SNP and DP SNP more negatively than anticipated.The sequencing depth intended for ddRAD-seq, originally allocated to specific regions, was spread across a larger area than expected.The theoretical calculation of wrongly assumed that only the theoretical EFs had been sequenced by the protocol.As a result, the average DP SNP value decreased from the expected 45X to an observed 11X.Given that a minimum of 30X is recommended for ddRAD-seq, based on a reference genome, to ensure comprehensive genotyping in all individuals, the decrease in CR SNP for SNPs called by ddRAD-seq was expected [22].As DP SNP and sensitivity are correlated [1,54], this decline in DP SNP is responsible for the low sensitivity observed at the expected EFs.Considering that low CR SNP and DP SNP values can impact genotyping reliability [55], it was hypothesized that genotyping SNPs outside the expected EFs might be less reliable than those within.
Upon comparison, the genotype concordance between ddRAD-seq and 20X WGS at 90% was highly satisfying.SNPs with high rates of erroneous genotypes (CcR) were indeed associated with lower CR SNP and DP SNP values than reliable genotypes.
Having genotype data for both ddRAD-seq and 20X WGS was a significant asset for our study, allowing for a detailed individual-level analysis of ddRAD-seq genotyping by pairwise genotype comparison.Quantifying genotyping errors enabled a thorough examination of the most common quality control filters' impact on genotyping reliability.Many studies apply consensus threshold filters from the literature without quantifying their impact on genotyping reliability, which, depending on the applications, can significantly affect study conclusions [15,26].
Our study described the relationship between genotyping reliability and commonly used quality control filters for ddRAD-seq data (CR SNP and DP SNP ), while measuring their impact on the retained SNP quantity.Fig 6C allows for comparison among different quality control scenarios' impact on genotyping reliability in terms of genotype reliability and SNP quantity.It offers the opportunity to establish decision rules on quality control thresholds tailored to each study's needs.
Applying the most stringent quality control filters on CR SNP and DP SNP still allow for the detection a reasonable number of markers.According to the literature, this marker count is largely sufficient for various applications in laying hens [39] such as genome-wide association studies [44], linkage disequilibrium calculation [45], CNV detection [46] or genomic selection [47].Hence, ddRAD-seq proves to be a reliable tool for laying hen genotyping, offering a superior number of SNPs with reliable genotypes, evenly distributed across the entire genome, making it a compelling alternative to LD chips and LD WGS.SNPs called by ddRAD-seq outside of these regions were identified and their locations were studied.The location of theses SNPs regarding the position of RS on the genome was empirically observed using IGV 2.7.2. Hypothesis about the possible reasons of the identification of SNPs outside expected EFs were made based on these observations and comparisons with the literature.SNPs were then classified according to these hypotheses and counted using bedtools v2.30.0.

SNP calling summary
For both ddRAD-seq and 20X WGS, the number and the distribution of genotyped SNPs along the chromosomes were estimated and compared using R V4.0.4.The correlation between the length of the chromosomes and the number of SNPs found on them was performed with the method of spearman for both methods using R V4.0.4.The average distance between two adjacent SNPs in bp was calculated on the whole genome and for each chromosome.
Then, for both ddRAD-seq and 20X WGS, three filtering parameters at the population scale were computed: (i) the ratio between the number of individuals with a non-missing genotype for a SNP and the total number of individuals, called the SNP call rate (CR SNP ), (ii) the minimum allele frequency (MAF) in the population and (iii) the SNP sequencing depth (DP SNP ) at the population scale.The genotype sequencing depth (DP GT ) corresponds to the number of reads supporting a genotype.DP SNP is the sum of each DP GT divided by the total number of genotyped individuals for this SNP.CR SNP and MAF calculations were performed using Plink V1.9 [60] and DP SNP using VCFtools V0.1.16[61] and R V4.0.4 [62].

ddRAD-seq variant calling sensitivity and precision
The variant calling sensitivity of ddRAD-seq was calculated as the number of SNP commonly called by ddRAD-seq and 20X WGS divided by the total number of SNPs called by 20X WGS.The variant calling precision of ddRAD-seq was calculated as the number of SNP commonly called by 20X WGS and ddRAD-seq divided by the total number of SNPs called by ddRADseq.Sensitivity and precision were also calculated considering only regions covered by expected EFs.

20X WGS and ddRAD-seq genotype concordance
First, the SNPs that were called with both 20X WGS and ddRAD-seq approaches were kept.The number and the repartition of these common SNPs on the chromosomes were analyzed.Then, individually, each genotype was compared between 20X WGS and ddRAD-seq.If one or both methods didn't allow to genotype the individual for a SNP, the genotypes were considered incomparable, and the SNP was excluded for this individual.Concerning the comparable genotypes, they were considered concordant when both alleles were the same between ddRAD-seq and 20X WGS.On the contrary, if one or two alleles were different for a genotype between ddRAD-seq and 20X WGS sequencing methods, they were considered discordant.For the concordant and discordant genotypes, DP GT were compared between ddRAD-seq and 20X WGS.They were obtained using VCFtools V0.1.16.Moreover, the Concordance rate (CcR) was computed as the ratio between the number of concordant genotypes and the number of comparable genotypes for each SNP.The mean CcR was calculated on all the common SNPs.Finally, ddRAD-seq data were filtered according to multiple combinations of Mean DP SNP and SNP CR SNP threshold.The evolution of the mean CcR and the number of retained SNPs were studied under these conditions.

Fig 1 .Fig 2 .Fig 3 .
Fig 1. Percentage of total SNP detected by 20X WGS (in orange) and ddRAD-seq (in blue) on chromosomes 1 to 39 and Z. https://doi.org/10.1371/journal.pone.0298565.g001 55) (t-test p-value < 0,05, Fig 4A).95.2% of the SNPs obtained with the 20X WGS were genotyped for all individuals (CR SNP = 1) while only 29.3% were in ddRAD-seq (Fig 4B).56.8% of the ddRAD-seq SNPs had a CR SNP below 0.8 against only 0.9% of the 20X WGS SNPs.Furthermore, the parabolic distribution of CR SNP values in ddRAD-seq showed two higher points, with a large proportion of SNPs having low CR SNP values (0-0.1) and an equivalent proportion having higher CR SNP values (Fig 4B).Similarly to CR SNP , the average DP SNP in ddRAD-seq was lower (11X) than the average DP SNP observed in 20X WGS (16X) and even lower than the average DP SNP expected (~45X, Fig4C).Theoretically in ddRAD-seq, 285,004 EFs between 200 and 500 bp should be generated and pair-end sequenced on an average of 150 bp.Therefore, we estimated that 85.5 Mb should be covered by ddRAD-seq.In laying hens, the genome size is 1.26 Gb[49] which means that we expect a mean DP SNP close to 45X in ddRAD-seq (450 Gb per Novaseq 6000 flowcell with 120 individuals per flowcell).The distribution of SNPs in DP SNP categories of ddRAD-seq showed two peaks of density: one at low DP SNP values (0-5X) and one at higher values(25)(26)(27)(28)(29)(30) Fig 4D).It was observed that

For
the SNPs inside and outside of the expected EFs, the CR SNP and the DP SNP were calculated.These two parameters were lower for the set of SNPs outside the theoretical EFs then for the SNPs inside the expected EFs (Fig 5).Mean CR SNP was 0.74 for SNPs inside the expected EFs and 0.25 for SNPs outside.The standard deviation of CR SNP was 0.35 for ddRAD-seq SNPs inside and 0.27 for SNPs outside the expected EFs.These values appeared consistent with the peaks observed at a genomic scale (Fig 4B).Mean DP SNP values were respectively 17X and 1X for SNPs inside and SNPs outside the expected EFs (Fig 5).The standard deviation of DP SNP for ddRAD-seq SNPs inside the expected EFs was 15X while it was 3X for SNPs outside.These DP SNP values also corresponded to those observed for each peak at a genomic scale (Fig 4D).The MAF was similar between 20X WGS (0.18) and ddRAD-seq (0.19).The distribution of MAF values was not different between ddRAD-seq and 20X WGS (S2 Fig).Individuals mean depth of sequencing (DP ind ) were 16X for 20X WGS and 11X for ddRAD-seq.Individual call rate (CR ind ) were respectively 99.4% for 20X WGS and 55.4% for ddRAD-seq.

Fig 5 .
Fig 5. Distribution of CR SNP , DP SNP between SNPs in (in blue) and out (light blue) of enzymatic fragments for ddRAD-seq compared to 20X WGS (in orange).

Fig 6 .
Fig 6.Genotyping quality of ddRAD-seq is correlated with the CR SNP and the DP SNP .(A) Distribution of genotypes sequencing depth (DP GT ) for 20X WGS, the reference, (in orange) and concordant (✓) or discordant (✘) genotypes in ddRAD-seq for SNPs in the enzymatic fragments theoretically generated by the ddRAD-seq protocol (in blue) or out the EFs theoretically generated by the ddRAD-seq protocol (in light blue).Percentages of each category of ddRAD-seq genotype on the whole amount of genotype was displayed (grey shades).(B) Distribution of SNPs sequencing depths (DP SNP ) according to their category of genotype concordance rate (CcR) and SNP call rate (CR SNP ).(C) Mean CcR of ddRAD-seq according to the CR SNP and the mean sequencing depth threshold.The blue gradient represents the proportion of ddRAD-seq SNPs kept according to each filter combination.https://doi.org/10.1371/journal.pone.0298565.g006

Table 1 . Number of SNPs called in 20X WGS and in ddRAD-seq and their percentage on the total number of SNPs by chromosome type.
https://doi.org/10.1371/journal.pone.0298565.t001